An Analysis of the Correlation Between NBA Statistics
The purpose of this study is to analyze the National Basketball Association and how different statistics, both individual and team, affect each other. In addition to some of the more basic statistics, such as points and assists, I will also study more advanced statistics that have become more significant as NBA teams become more data-oriented. These statistics guide NBA teams and players daily, therefore having an enhanced knowledge on the impact they have can give people important insight into why they make the decisions they do.
I will be using 4 main datasets throughout this presentation (some of which have been combined from more datasets to add in several variables such as player countries, heights, weights, etc.). The player statistics datasets I will be using are based off of the 2020-21 NBA season. They contain data for 481 NBA players and include their per game stats and advanced stats. The last dataset I will be using is for team statistics during the 2020-21 season.
This presentation is tasked with answering some of the following questions:
What positions tend to have better stats in each of the main statistical categories?
How does height/weight affect offensive and defensive stats in general (could use OBPM, DBPM, or basic per game stats), if at all?
How does a team’s payroll affect their overall performance?
Rows: 481
Columns: 33
$ Player <chr> "Precious Achiuwa", "Jaylen Adams", "Steven Adams", "Bam Adeba…
$ Pos <fct> PF, PG, C, C, C, SG, SG, SG, C, PF, PF, PF, PF, SF, PF, PG, SF…
$ Age <dbl> 21, 24, 27, 23, 35, 22, 22, 25, 22, 30, 27, 26, 23, 28, 36, 20…
$ Tm <chr> "MIA", "MIL", "NOP", "MIA", "SAS", "PHO", "NOP", "MEM", "TOT",…
$ G <dbl> 28, 6, 27, 26, 18, 3, 23, 19, 28, 2, 24, 27, 1, 18, 27, 25, 18…
$ GS <dbl> 2, 0, 27, 26, 18, 0, 3, 8, 10, 0, 24, 27, 0, 0, 3, 17, 18, 0, …
$ MP <dbl> 14.6, 2.8, 28.1, 33.6, 26.7, 2.7, 19.2, 23.9, 26.2, 8.0, 28.1,…
$ FG <dbl> 2.6, 0.2, 3.5, 7.4, 5.9, 0.0, 3.3, 3.2, 4.4, 0.5, 5.0, 10.3, 0…
$ FGA <dbl> 4.4, 1.3, 5.8, 12.9, 12.5, 1.0, 8.2, 7.4, 6.8, 0.5, 10.7, 18.4…
$ `FG%` <dbl> 0.590, 0.125, 0.603, 0.573, 0.476, 0.000, 0.410, 0.429, 0.642,…
$ `3P` <dbl> 0.0, 0.0, 0.0, 0.1, 1.3, 0.0, 1.0, 2.3, 0.0, 0.5, 1.7, 1.1, 0.…
$ `3PA` <dbl> 0.0, 0.3, 0.0, 0.2, 3.7, 0.3, 3.8, 5.3, 0.1, 0.5, 4.3, 4.0, 0.…
$ `3P%` <dbl> 0.000, 0.000, 0.000, 0.400, 0.358, 0.000, 0.276, 0.436, 0.250,…
$ `2P` <dbl> 2.6, 0.2, 3.5, 7.3, 4.6, 0.0, 2.3, 0.8, 4.3, 0.0, 3.3, 9.2, 0.…
$ `2PA` <dbl> 4.4, 1.0, 5.7, 12.7, 8.8, 0.7, 4.4, 2.1, 6.6, 0.0, 6.4, 14.4, …
$ `2P%` <dbl> 0.590, 0.167, 0.606, 0.576, 0.525, 0.000, 0.525, 0.410, 0.651,…
$ `eFG%` <dbl> 0.590, 0.125, 0.603, 0.576, 0.529, 0.000, 0.473, 0.586, 0.645,…
$ FT <dbl> 1.3, 0.0, 1.1, 5.1, 0.9, 0.0, 1.1, 1.7, 3.6, 0.0, 2.1, 6.4, 0.…
$ FTA <dbl> 2.4, 0.0, 2.3, 6.0, 1.2, 0.0, 1.4, 1.9, 4.7, 1.0, 2.7, 9.9, 0.…
$ `FT%` <dbl> 0.561, 0.000, 0.468, 0.841, 0.762, 0.000, 0.781, 0.892, 0.758,…
$ ORB <dbl> 1.3, 0.0, 4.3, 1.9, 0.8, 0.0, 0.2, 0.4, 2.9, 0.5, 0.9, 1.7, 0.…
$ DRB <dbl> 2.7, 0.5, 4.6, 7.3, 3.5, 0.3, 2.4, 2.5, 6.1, 1.5, 5.3, 9.7, 4.…
$ TRB <dbl> 4.0, 0.5, 8.9, 9.2, 4.3, 0.3, 2.7, 2.9, 9.0, 2.0, 6.3, 11.4, 4…
$ AST <dbl> 0.6, 0.3, 2.1, 5.3, 1.9, 0.3, 2.0, 2.1, 1.6, 1.0, 3.8, 5.8, 0.…
$ STL <dbl> 0.4, 0.0, 1.0, 1.0, 0.4, 0.0, 1.1, 1.0, 0.5, 0.0, 1.1, 1.3, 0.…
$ BLK <dbl> 0.5, 0.0, 0.6, 1.0, 0.9, 0.0, 0.3, 0.2, 1.6, 0.0, 0.8, 1.3, 2.…
$ TOV <dbl> 1.0, 0.0, 1.7, 3.0, 0.9, 0.0, 1.3, 1.1, 1.5, 1.0, 1.4, 3.7, 2.…
$ PF <dbl> 1.9, 0.2, 1.9, 2.6, 1.5, 0.3, 1.7, 1.3, 1.6, 0.0, 1.8, 3.1, 1.…
$ PTS <dbl> 6.5, 0.3, 8.0, 19.9, 14.1, 0.0, 8.8, 10.4, 12.3, 1.5, 13.8, 28…
$ COUNTRY <chr> "Nigeria", NA, "New Zealand", "USA", "USA", NA, "Canada", "USA…
$ salary <int> 2582160, 449115, 29592695, 5115492, 17628340, 449115, 3113160,…
$ height <dbl> 80, 74, 84, 82, 83, 75, 77, 77, 83, 81, 81, 81, 82, 79, 80, 74…
$ weight <dbl> 225, 190, 255, 255, 245, 195, 205, 198, 234, 215, 230, 205, 20…
Rows: 481
Columns: 28
$ Player <chr> "Precious Achiuwa", "Jaylen Adams", "Steven Adams", "Bam Adeba…
$ Pos <chr> "PF", "PG", "C", "C", "C", "SG", "SG", "SG", "C", "PF", "PF", …
$ Age <dbl> 21, 24, 27, 23, 35, 22, 22, 25, 22, 30, 27, 26, 23, 28, 36, 20…
$ Tm <chr> "MIA", "MIL", "NOP", "MIA", "SAS", "PHO", "NOP", "MEM", "TOT",…
$ G <dbl> 28, 6, 27, 26, 18, 3, 23, 19, 28, 2, 24, 27, 1, 18, 27, 25, 18…
$ MP <dbl> 408, 17, 760, 873, 480, 8, 441, 454, 734, 16, 675, 906, 8, 149…
$ PER <dbl> 15.1, -6.9, 15.9, 22.7, 15.2, -11.9, 12.0, 14.0, 22.5, 7.5, 17…
$ `TS%` <dbl> 0.599, 0.125, 0.592, 0.641, 0.542, 0.000, 0.502, 0.630, 0.695,…
$ `3PAr` <dbl> 0.000, 0.250, 0.006, 0.015, 0.298, 0.333, 0.463, 0.721, 0.021,…
$ FTr <dbl> 0.541, 0.000, 0.397, 0.469, 0.093, 0.000, 0.170, 0.264, 0.695,…
$ `ORB%` <dbl> 10.5, 0.0, 16.9, 6.8, 3.2, 0.0, 1.3, 1.7, 12.6, 6.2, 3.5, 5.6,…
$ `DRB%` <dbl> 19.8, 18.2, 18.0, 23.2, 14.0, 13.6, 14.1, 12.0, 25.5, 20.3, 21…
$ `TRB%` <dbl> 15.4, 9.4, 17.5, 15.4, 8.4, 6.9, 7.7, 6.7, 19.1, 13.0, 12.2, 1…
$ `AST%` <dbl> 6.8, 13.4, 10.1, 27.9, 11.4, 14.7, 14.9, 11.5, 9.0, 16.7, 19.4…
$ `STL%` <dbl> 1.4, 0.0, 1.7, 1.4, 0.7, 0.0, 2.8, 2.0, 0.9, 0.0, 1.9, 1.8, 0.…
$ `BLK%` <dbl> 3.8, 0.0, 2.0, 3.2, 2.8, 0.0, 1.9, 0.6, 5.5, 0.0, 2.5, 3.5, 21…
$ `TOV%` <dbl> 16.1, 0.0, 20.1, 16.2, 6.4, 0.0, 12.9, 11.3, 14.8, 51.5, 10.7,…
$ `USG%` <dbl> 19.7, 19.7, 12.8, 24.6, 22.3, 16.8, 22.4, 16.5, 17.1, 10.3, 20…
$ OWS <dbl> 0.3, -0.1, 1.2, 2.3, 0.2, -0.1, -0.2, 0.7, 2.3, 0.0, 1.1, 2.7,…
$ DWS <dbl> 0.6, 0.0, 0.5, 1.3, 0.5, 0.0, 0.4, 0.4, 0.8, 0.0, 0.8, 1.5, 0.…
$ WS <dbl> 0.9, -0.1, 1.7, 3.6, 0.7, -0.1, 0.2, 1.1, 3.1, 0.0, 1.9, 4.3, …
$ `WS/48` <dbl> 0.101, -0.265, 0.109, 0.196, 0.075, -0.327, 0.025, 0.113, 0.20…
$ OBPM <dbl> -2.8, -15.6, -0.1, 2.9, 0.3, -16.4, -2.6, 0.4, 2.3, -3.4, 1.9,…
$ DBPM <dbl> -0.2, -5.2, -1.0, 2.0, -1.0, -4.8, 0.1, 0.1, 0.4, 0.1, 1.1, 2.…
$ BPM <dbl> -3.0, -20.9, -1.1, 4.9, -0.7, -21.2, -2.5, 0.5, 2.7, -3.3, 2.9…
$ VORP <dbl> -0.1, -0.1, 0.2, 1.5, 0.2, 0.0, -0.1, 0.3, 0.9, 0.0, 0.8, 2.1,…
$ salary <int> 2582160, 449115, 29592695, 5115492, 17628340, 449115, 3113160,…
$ MPG <dbl> 14.6, 2.8, 28.1, 33.6, 26.7, 2.7, 19.2, 23.9, 26.2, 8.0, 28.1,…
Rows: 30
Columns: 28
$ team <chr> "Phoenix Suns", "Golden State Warriors", "Memphis Grizzlies", …
$ GP <dbl> 52, 53, 55, 54, 53, 55, 54, 53, 53, 54, 51, 53, 53, 55, 53, 54…
$ W <dbl> 42, 40, 37, 34, 33, 34, 33, 32, 32, 31, 28, 29, 29, 30, 28, 28…
$ L <dbl> 10, 13, 18, 20, 20, 21, 21, 21, 21, 23, 23, 24, 24, 25, 25, 26…
$ `WIN%` <dbl> 0.808, 0.755, 0.673, 0.630, 0.623, 0.618, 0.611, 0.604, 0.604,…
$ MIN <dbl> 48.1, 48.2, 48.3, 48.5, 48.1, 48.2, 48.0, 48.4, 48.0, 48.3, 48…
$ PTS <dbl> 112.7, 110.9, 112.7, 108.7, 111.6, 112.7, 106.5, 107.8, 113.6,…
$ FGM <dbl> 42.7, 40.4, 42.7, 39.3, 41.6, 40.7, 39.5, 39.6, 40.6, 39.1, 40…
$ FGA <dbl> 89.4, 86.5, 93.4, 85.7, 87.0, 88.9, 85.1, 85.1, 85.9, 86.4, 91…
$ `FG%` <dbl> 47.8, 46.7, 45.7, 45.9, 47.8, 45.8, 46.4, 46.6, 47.3, 45.3, 44…
$ `3PM` <dbl> 11.5, 14.6, 11.1, 13.5, 11.2, 14.3, 11.8, 11.0, 14.6, 12.3, 12…
$ `3PA` <dbl> 31.7, 40.1, 32.7, 36.1, 30.0, 39.4, 33.7, 30.9, 40.0, 36.8, 34…
$ `3P%` <dbl> 36.3, 36.4, 33.9, 37.5, 37.2, 36.4, 35.1, 35.8, 36.4, 33.5, 35…
$ FTM <dbl> 15.8, 15.5, 16.2, 16.5, 17.2, 16.9, 15.7, 17.5, 17.8, 15.6, 15…
$ FTA <dbl> 20.0, 20.3, 22.0, 20.2, 21.2, 21.6, 20.9, 21.7, 22.9, 20.2, 20…
$ `FT%` <dbl> 79.1, 76.4, 73.7, 81.5, 81.4, 78.2, 75.1, 80.9, 77.8, 77.0, 75…
$ OREB <dbl> 10.2, 10.1, 13.6, 10.8, 8.9, 10.3, 10.4, 8.4, 10.1, 9.5, 13.2,…
$ DREB <dbl> 35.9, 36.4, 35.0, 33.8, 34.1, 36.5, 34.9, 33.7, 35.7, 34.3, 31…
$ REB <dbl> 46.1, 46.5, 48.6, 44.6, 43.0, 46.8, 45.3, 42.1, 45.8, 43.8, 45…
$ AST <dbl> 26.5, 27.5, 25.1, 25.9, 24.5, 23.4, 25.5, 23.2, 22.2, 24.0, 22…
$ TOV <dbl> 13.3, 15.6, 13.3, 14.9, 13.0, 13.7, 14.9, 12.5, 14.3, 12.6, 12…
$ STL <dbl> 8.6, 9.4, 10.1, 7.6, 7.2, 7.7, 7.2, 7.6, 7.1, 7.1, 9.2, 7.0, 7…
$ BLK <dbl> 4.3, 4.9, 6.4, 3.3, 4.6, 4.2, 4.3, 5.7, 4.8, 4.1, 4.9, 5.5, 3.…
$ BLKA <dbl> 4.0, 4.1, 6.4, 4.4, 5.2, 4.5, 4.5, 4.6, 4.2, 3.9, 5.1, 5.2, 4.…
$ PF <dbl> 19.3, 20.3, 19.1, 20.5, 18.8, 17.8, 17.0, 19.1, 18.8, 19.7, 19…
$ PFD <dbl> 19.3, 17.7, 19.0, 20.0, 17.8, 19.2, 19.2, 18.9, 20.1, 19.9, 18…
$ `+/-` <dbl> 7.8, 8.3, 4.1, 4.2, 1.7, 4.0, 4.4, 2.2, 6.0, 2.7, 1.3, 0.5, 1.…
$ payroll <int> 128858241, 171105334, 132022601, 134731235, 128963580, 1366239…
In this heatmap, I have filtered out all of the players from the US so that we can better see the number of players from other countries.
Removed under 10 ppg because every height had a lot of entries with low ppg
used over 10ppg - didn’t want to bring it down with people who don’t actually play
decided to break down into top ws because it ranks overall play and takes into account minutes played - wanted to get rid of people who don’t play much
One of the major limitations of this study is that the best dataset I could find was from the 2020-21 NBA season. There were other NBA datasets that I could have used, yet this was the only one that included the advanced stats that many analysts look at today. While this is still recent enough to provide insight on the NBA today, one more limitation is that the dataset was created in the middle of the 20-21 season. As a result, the sample size for the players’ statistics is only part of a season.
---
title: "NBA Statistical Analysis"
output:
flexdashboard::flex_dashboard:
theme:
bootswatch: materia
primary: "#F54242"
secondary: "#2196f3"
orientation: columns
vertical_layout: fill
source_code: embed
---
<style>
.chart-title { /* chart_title */
font-size: 20px;
}
body{ /* Normal */
font-size: 16px;
}
</style>
```{css color tabs}
/* Set font color of inactive tab to black */
.nav-tabs-custom .nav-tabs > li > a
{
color: #black;
}
/* Set font color of active tab to blue */
.nav-tabs-custom .nav-tabs > li.active > a
{
color: #2196f3;
}
/* To set color on hover */
.nav-tabs-custom .nav-tabs > li.active > a:hover
{
color: grey;
}
<style type="text/css"> .sidebar
{
overflow: auto;
}
</style>
```
```{r setup, include=FALSE}
library(flexdashboard)
```
```{r data/packages}
library(pacman)
p_load(tidyverse, maps, viridis, plotly, DT, gridExtra)
nba_advanced <- read_csv("/Users/christopherbussen/Documents/School/UDS2023/MTH209/finalProject/nba2021_advanced.csv")
nba_advanced <- nba_advanced[!duplicated(nba_advanced$Player), ]
nba_per_game <- read_csv("/Users/christopherbussen/Documents/School/UDS2023/MTH209/finalProject/nba2021_per_game.csv")
nba_per_game <- nba_per_game %>%
mutate(Pos = recode(Pos, 'F-C'='C', 'SF-PF'='SF', 'G'='PG', 'F'='PF', ))
nba_per_game$Pos <- factor(nba_per_game$Pos,
levels = c("PG", "SG", "SF", "PF", "C"))
nba_team_stats <- read_csv("/Users/christopherbussen/Documents/School/UDS2023/MTH209/finalProject/nba_team_stats_00_to_21.csv")
nba_team_stats <- nba_team_stats %>%
filter(SEASON == "2020-21") %>%
rename("team" = "TEAM")
payroll <- read_csv("/Users/christopherbussen/Documents/School/UDS2023/MTH209/finalProject/NBA Payroll(1990-2023).csv")
payroll <- payroll %>%
filter(seasonStartYear == 2020) %>%
subset(select = c("team", "payroll"))
# convert payroll to int
payroll$payroll <-gsub("[^0-9.]", "", payroll$payroll)
payroll$payroll <- as.integer(payroll$payroll)
salaries <- read_csv("/Users/christopherbussen/Documents/School/UDS2023/MTH209/finalProject/NBA Salaries(1990-2023).csv")
salaries <- salaries %>%
filter(seasonStartYear == 2020) %>%
rename("Player" = "playerName") %>%
subset(select = c("Player", "salary"))
# convert salary to int
salaries$salary <- gsub("[^0-9.]", "", salaries$salary)
salaries$salary <- as.integer(salaries$salary)
country <- read_csv("/Users/christopherbussen/Documents/School/UDS2023/MTH209/finalProject/nba_all_teams.csv")
country <- country %>%
rename("Player" = "Player Name") %>%
subset(select = c("Player", "COUNTRY"))
height_and_weight <- read_csv("/Users/christopherbussen/Documents/School/UDS2023/MTH209/finalProject/all_seasons.csv")
height_and_weight <- height_and_weight %>%
rename("Player" = "player_name",
"height" = "player_height",
"weight" = "player_weight") %>%
subset(select = c("Player", "height", "weight"))
# convert to inches
height_and_weight$height <- height_and_weight$height / 2.54
# convert to lbs
height_and_weight$weight <- height_and_weight$weight * 2.20462
height_and_weight$weight <- round(height_and_weight$weight, 0)
# add country to dataset
nba_per_game <- nba_per_game %>%
left_join(country, by = "Player")
# add salary to dataset
nba_per_game <- nba_per_game %>%
left_join(salaries, by = "Player")
nba_advanced <- nba_advanced %>%
left_join(salaries, by = "Player")
# create mpg for nba advanced
nba_advanced <- nba_advanced %>%
mutate(MPG = MP / G)
nba_advanced$MPG <- round(nba_advanced$MPG, 1)
# add height and weight to dataset and get rid of duplicate players
nba_per_game <- nba_per_game %>%
left_join(height_and_weight, by = "Player")
nba_per_game <- nba_per_game[!duplicated(nba_per_game$Player), ]
nba_team_stats <- nba_team_stats %>%
left_join(payroll, by = "team")
nba_team_stats <- select(nba_team_stats,-teamstatspk, -SEASON)
addCountry <- filter(nba_per_game, is.na(COUNTRY))
```
Introduction
===
Column {.tabset data-width=650}
-----------------------------------------------------------------------
### Basic Info
<font size = 5>
**An Analysis of the Correlation Between NBA Statistics**
</font>
The purpose of this study is to analyze the National Basketball Association and how different statistics, both individual and team, affect each other. In addition to some of the more basic statistics, such as points and assists, I will also study more advanced statistics that have become more significant as NBA teams become more data-oriented. These statistics guide NBA teams and players daily, therefore having an enhanced knowledge on the impact they have can give people important insight into why they make the decisions they do.
I will be using 4 main datasets throughout this presentation (some of which have been combined from more datasets to add in several variables such as player countries, heights, weights, etc.). The player statistics datasets I will be using are based off of the 2020-21 NBA season. They contain data for 481 NBA players and include their per game stats and advanced stats. The last dataset I will be using is for team statistics during the 2020-21 season.
This presentation is tasked with answering some of the following questions:
- What positions tend to have better stats in each of the main statistical categories?
- How does height/weight affect offensive and defensive stats in general (could use OBPM, DBPM, or basic per game stats), if at all?
- How does a team’s payroll affect their overall performance?
### Glimpse of Per Game
```{r}
glimpse(nba_per_game)
```
### Glimpse of Advanced
```{r}
glimpse(nba_advanced)
```
### Glimpse of Team Stats
```{r}
glimpse(nba_team_stats)
```
Column {data-width=350}
-----------------------------------------------------------------------
### Explanation of Variables
Player Overview
===
Column {.tabset}
-----
### Per Game Table
```{r pg table}
DT::datatable(nba_per_game[,1:32], rownames = FALSE,
options = list(columnDefs = list(list(className = 'dt-center', targets = 1:31))))
```
### Advanced Table
```{r advanced table}
DT::datatable(nba_advanced[,1:26], rownames = FALSE,
options = list(columnDefs = list(list(className = 'dt-center', targets = 1:25))))
```
### Team Stats Table
```{r team table}
DT::datatable(nba_team_stats[,1:28], rownames = FALSE,
options = list(columnDefs = list(list(className = 'dt-center', targets = 1:27))))
```
### Birthplaces
```{r world map 1, echo=FALSE}
world <- map_data("world")
count <- nba_per_game %>%
group_by(COUNTRY) %>%
summarize(count = n())
birthplaces <- count %>%
left_join(world, by = c("COUNTRY" = "region"))
# need to use map and another first geom_polygon to plot the world map by itself
# this way map still shows up in areas where there are no players
p1 <- world %>%
ggplot() +
geom_polygon(aes(x=long, y=lat, group=group, text = region), fill = "grey", alpha=0.5) +
geom_polygon(data = birthplaces, aes(x=long, y=lat, group=group, fill = count, text = paste0(COUNTRY, ":\n", count, " NBA Player(s)"))) +
scale_fill_viridis_c(option = "H") +
theme_void() +
labs(title = "NBA Players Birthplaces")
ggplotly(p1, tooltip = "text")
```
### Birthplaces (filtered)
In this heatmap, I have filtered out all of the players from the US so that we can better see the number of players from other countries.
```{r world map 2}
birthplaces <- filter(birthplaces, COUNTRY != "USA")
p2 <- world %>%
ggplot() +
geom_polygon(aes(x = long, y = lat, group = group, text = region), fill = "grey", alpha=0.5) +
geom_polygon(data = birthplaces, aes(x = long, y = lat, group = group, fill = count, text = paste0(COUNTRY, ":\n", count, " NBA Player(s)"))) +
scale_fill_viridis_c(option = "H") +
theme_void() +
labs(title = "NBA Players Birthplaces")
ggplotly(p2, tooltip = "text")
```
Height
===
Column {.tabset data-width=850 .no-padding}
-----
### Points
```{r}
# create height group
nba_per_game$height_group <- cut(nba_per_game$height, breaks = c(66,75,79,83, Inf), labels = c( "<6'4","6'4-6'7","6'8-6'11","6'11+"))
over10ppg <- filter(nba_per_game, nba_per_game$PTS > 10)
ptsH <- ggplot(over10ppg, aes(x = height, y = PTS)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(70, 85, by=3), limits = c(70, 85)) +
labs(title="Distribution of PPG Based on Height", x="Height (in.)", y="Points") +
theme(plot.title = element_text(hjust = 0.5))
ptsHGroup <- ggplot(na.omit(over10ppg), aes(x = height_group, y = PTS)) +
geom_boxplot(fill = "#2196f3") +
labs(title ="", x="Height Group", y = NULL) +
theme(text = element_text(size = 10))
grid.arrange(ptsH, ptsHGroup, ncol = 2, widths = c(1.9, 1))
```
Removed under 10 ppg because every height had a lot of entries with low ppg
### Assists
```{r}
over2ast <- filter(nba_per_game, nba_per_game$AST > 2)
astsH <- ggplot(over2ast, aes(x = height, y = AST)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(70, 85, by=3), limits = c(70, 85)) +
labs(title="Distribution of APG Based on Height", x="Height (in.)", y="Assists") +
theme(plot.title = element_text(hjust = 0.5))
astsHGroup <- ggplot(na.omit(over2ast), aes(x = height_group, y = AST)) +
geom_boxplot(fill = "#2196f3") +
labs(title ="", x="Height Group", y = NULL) +
theme(text = element_text(size = 10))
grid.arrange(astsH, astsHGroup, ncol = 2, widths = c(1.9, 1))
```
### Rebounds
```{r}
over3rb <- filter(nba_per_game, nba_per_game$TRB > 3)
rbsH <- ggplot(over3rb, aes(x = height, y = TRB)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(70, 90, by=4), limits = c(70, 90)) +
labs(title="Distribution of RPG Based on Height", x="Height (in.)", y="Rebounds") +
theme(plot.title = element_text(hjust = 0.5))
rbsHGroup <- ggplot(na.omit(over3rb), aes(x = height_group, y = TRB)) +
geom_boxplot(fill = "#2196f3") +
labs(title ="", x="Height Group", y = NULL) +
theme(text = element_text(size = 10))
grid.arrange(rbsH, rbsHGroup, ncol = 2, widths = c(1.9, 1))
```
### Blocks
```{r}
blkH <- ggplot(nba_per_game, aes(x = height, y = BLK)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(70, 90, by=4), limits = c(70, 90)) +
labs(title="Distribution of BPG Based on Height", x="Height (in.)", y="Blocks") +
theme(plot.title = element_text(hjust = 0.5))
blkHGroup <- ggplot(na.omit(nba_per_game), aes(x = height_group, y = BLK)) +
geom_boxplot(fill = "#2196f3") +
labs(title ="", x="Height Group", y = NULL) +
theme(text = element_text(size = 10))
grid.arrange(blkH, blkHGroup, ncol = 2, widths = c(1.9, 1))
```
### Steals
```{r}
stlH <- ggplot(nba_per_game, aes(x = height, y = STL)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(70, 90, by=4), limits = c(70, 90)) +
labs(title="Distribution of STL Based on Height", x="Height (in.)", y="Steals") +
theme(plot.title = element_text(hjust = 0.5))
stlHGroup <- ggplot(na.omit(nba_per_game), aes(x = height_group, y = STL)) +
geom_boxplot(fill = "#2196f3") +
labs(title ="", x="Height Group", y = NULL) +
theme(text = element_text(size = 10))
grid.arrange(stlH, stlHGroup, ncol = 2, widths = c(1.9, 1))
```
### FG%
```{r}
fgH <- ggplot(nba_per_game, aes(x = height, y = `FG%`)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(70, 90, by=4), limits = c(70, 90)) +
labs(title="Distribution of FG% Based on Height", x="Height (in.)", y="FG%") +
theme(plot.title = element_text(hjust = 0.5))
fgHGroup <- ggplot(na.omit(nba_per_game), aes(x = height_group, y = `FG%`)) +
geom_boxplot(fill = "#2196f3") +
labs(title ="", x="Height Group", y = NULL) +
theme(text = element_text(size = 10))
grid.arrange(fgH, fgHGroup, ncol = 2, widths = c(1.9, 1))
```
Column
-----------------------------------------------------------------------
### Height Analysis
Weight
===
Column {.tabset data-width=650}
-----
### Points
```{r}
ggplot(over10ppg, aes(x = weight, y = PTS)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(170, 300, by=25), limits = c(170, 300)) +
labs(title="Distribution of PPG Based on Weight", x="Weight (lbs.)", y="Points") +
theme(plot.title = element_text(hjust = 0.5))
```
### Assists
```{r}
ggplot(over2ast, aes(x = weight, y = AST)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(170, 300, by=25), limits = c(170, 300)) +
labs(title="Distribution of APG Based on Weight", x="Weight (lbs.)", y="Assists") +
theme(plot.title = element_text(hjust = 0.5))
```
### Rebounds
```{r}
ggplot(over3rb, aes(x = weight, y = TRB)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(170, 300, by=25), limits = c(170, 300)) +
labs(title="Distribution of RPG Based on Weight", x="Weight (lbs.)", y="Rebounds") +
theme(plot.title = element_text(hjust = 0.5))
```
### Blocks
```{r}
ggplot(nba_per_game, aes(x = weight, y = BLK)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(160, 315, by=25), limits = c(160, 315)) +
labs(title="Distribution of BPG Based on Weight", x="Weight (lbs.)", y="Blocks") +
theme(plot.title = element_text(hjust = 0.5))
```
### Steals
```{r}
ggplot(nba_per_game, aes(x = weight, y = STL)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(160, 315, by=25), limits = c(160, 315)) +
labs(title="Distribution of STL Based on Weight", x="Weight (lbs.)", y="Steals") +
theme(plot.title = element_text(hjust = 0.5))
```
### FG%
```{r}
ggplot(nba_per_game, aes(x = weight, y = `FG%`)) +
geom_point(col = "#2196f3") +
scale_x_continuous(breaks = seq(160, 315, by=25), limits = c(160, 315)) +
labs(title="Distribution of FG% Based on Weight", x="Weight (lbs.)", y="FG%") +
theme(plot.title = element_text(hjust = 0.5))
```
Column
---
### Weight Analysis
Position Analysis
===
Column {.tabset data_width=650}
---
### Points
```{r}
ggplot(nba_per_game, aes(x = Pos, y = PTS)) +
geom_boxplot(fill = "#2196f3") +
scale_y_continuous(breaks = seq(0, 35, by=5), limits = c(0, 35)) +
labs(title="Effect of Position on PPG", x="Position", y="Points") +
theme(plot.title = element_text(hjust = 0.5))
```
### Assists
```{r}
ggplot(nba_per_game, aes(x = Pos, y = AST)) +
geom_boxplot(fill = "#2196f3") +
scale_y_continuous(breaks = seq(0, 12, by=2), limits = c(0, 12)) +
labs(title="Effect of Position on APG", x="Position", y="Assist") +
theme(plot.title = element_text(hjust = 0.5))
```
### Rebounds
```{r}
ggplot(nba_per_game, aes(x = Pos, y = TRB)) +
geom_boxplot(fill = "#2196f3") +
scale_y_continuous(breaks = seq(0, 15, by=3), limits = c(0, 15)) +
labs(title="Effect of Position on RPG", x="Position", y="Rebounds") +
theme(plot.title = element_text(hjust = 0.5))
```
### Steals
```{r}
ggplot(nba_per_game, aes(x = Pos, y = STL)) +
geom_boxplot(fill = "#2196f3") +
scale_y_continuous(breaks = seq(0, 2, by=.5), limits = c(0, 2)) +
labs(title="Effect of Position on STL", x="Position", y="Steals") +
theme(plot.title = element_text(hjust = 0.5))
```
### Blocks
```{r}
ggplot(nba_per_game, aes(x = Pos, y = BLK)) +
geom_boxplot(fill = "#2196f3") +
scale_y_continuous(breaks = seq(0, 3.5, by=.5), limits = c(0, 3.5)) +
labs(title="Effect of Position on BPG", x="Position", y="Blocks") +
theme(plot.title = element_text(hjust = 0.5))
```
### FG%
```{r}
ggplot(nba_per_game, aes(x = Pos, y = `FG%`)) +
geom_boxplot(fill = "#2196f3") +
scale_y_continuous(breaks = seq(0, 1, by=.25), limits = c(0, 1)) +
labs(title="Effect of Position on FG%", x="Position", y="FG%") +
theme(plot.title = element_text(hjust = 0.5))
```
### Avg. Salary
```{r avg salary ~ pos}
avgPosSalary <- over10ppg %>%
group_by(Pos) %>%
summarise(
avgSalary = mean(salary, na.rm = T)
)
avgSalaries <- ggplot(avgPosSalary, aes(x = Pos, y = avgSalary)) +
geom_col(fill = "#2196f3", aes(text = paste0("Position: ", Pos, "\nAverage Salary: $", round(avgSalary,0)))) +
labs(title="Average Salary by Position", x="Position", y="Average Salary ($)") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(breaks = seq(0, 20000000, by=4000000), limits = c(0, 20000000), labels = scales::comma)
ggplotly(avgSalaries, tooltip = "text")
```
used over 10ppg - didn't want to bring it down with people who don't actually play
Column
---
### Analysis
Team Analysis
===
Column {.tabset data_width=650}
---
### Payroll vs. Wins
```{r payroll scatter}
payroll <- ggplot(nba_team_stats, aes(x = payroll, y = `WIN%`)) +
geom_point(col = "#2196f3", aes(text = paste0("Team: ", nba_team_stats$team, "\nPayroll: $", payroll, "\nWin %: ", nba_team_stats$`WIN%`))) +
scale_x_continuous(breaks = seq(90000000, 175000000, by=20000000), limits = c(90000000, 175000000), labels = scales::comma) +
scale_y_continuous(breaks = seq(0, 1, by=.25), limits = c(0,1)) +
labs(title="Effect of Payroll on Win Percentage", x="Payroll ($)", y="Win %") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(payroll, tooltip = "text")
```
### PPG vs. Wins
```{r ppg scatter}
ppg <- ggplot(nba_team_stats, aes(x = PTS, y = `WIN%`, label = team)) +
geom_point(col = "#2196f3", aes(text = paste0("Team: ", nba_team_stats$team, "\nPoints Per Game: ", PTS, "\nWin %: ", nba_team_stats$`WIN%`))) +
scale_x_continuous(breaks = seq(100, 115, by=3), limits = c(100, 115)) +
scale_y_continuous(breaks = seq(0, 1, by=.25), limits = c(0,1)) +
labs(title="Effect of Points Per Game on Win Percentage", x="PPG", y="Win %") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(ppg, tooltip = "text")
```
### 3P% vs. Wins
```{r 3p scatter}
threePct <- ggplot(nba_team_stats, aes(x = `3P%`, y = `WIN%`, label = team)) +
geom_point(col = "#2196f3", aes(text = paste0("Team: ", nba_team_stats$team, "\n3P%: ", `3P%`, "\n3PM: ", nba_team_stats$`3PM`, "\n3PA: ", nba_team_stats$`3PA`, "\nWin %: ", nba_team_stats$`WIN%`))) +
scale_x_continuous(breaks = seq(30, 40, by=2), limits = c(30, 40)) +
labs(title="Effect of 3 Point Percentage on Win Percentage", x="3P%", y="Win %") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(threePct, tooltip = "text")
```
Column
---
### Analysis
Advanced Stats
===
Column {.tabset data-width=650}
---
### Highest WS
```{r}
best50 <- nba_advanced %>%
arrange(desc(WS)) %>%
slice(1:50)
DT::datatable(best50[,1:28], rownames = FALSE,
options = list(columnDefs = list(list(className = 'dt-center', targets = 1:27))))
```
### TS% vs. PER
```{r}
ppg <- ggplot(best50, aes(x = `TS%`, y = PER, label = Player)) +
geom_point(col = "#2196f3", aes(text = paste0("Player: ", best50$Player, "\nPosition: ", best50$Pos, "\nTrue Shooting %: ", `TS%`, "\nPER: ", best50$PER, "\nMPG: ", best50$MPG))) +
scale_x_continuous(breaks = seq(0.5, 0.75, by=.05), limits = c(0.5, 0.75)) +
scale_y_continuous(breaks = seq(0, 35, by=5), limits = c(0,35)) +
labs(title="Relationship between TS% and PER", x="TS%", y="PER") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(ppg, tooltip = "text")
```
### WS/48 vs. Minutes
Column
---
### Analysis
decided to break down into top ws because it ranks overall play and takes into account minutes played - wanted to get rid of people who don't play much
Conclusion
===
Column {data-length=650}
---
### Results
### Limitations
One of the major limitations of this study is that the best dataset I could find was from the 2020-21 NBA season. There were other NBA datasets that I could have used, yet this was the only one that included the advanced stats that many analysts look at today. While this is still recent enough to provide insight on the NBA today, one more limitation is that the dataset was created in the middle of the 20-21 season. As a result, the sample size for the players' statistics is only part of a season.
### References
https://www.kaggle.com/datasets/umutalpaydn/nba-20202021-season-player-stats
https://www.kaggle.com/datasets/justinas/nba-players-data
https://www.kaggle.com/datasets/loganlauton/nba-players-and-team-data?select=NBA+Payroll%281990-2023%29.csv
About the Author
===
Column {data-width = 650}
---
### About Me
My name is Christopher Bussen and I am an undergraduate student at the University of Dayton. I am currently working towards my B.S. in Computer Science with minors in Mathematics and Data Analytics and am on track to graduate in May 2024.
After graduation, I am interested in pursuing full-time employment in a data analytics position, especially one that allows me to combine my love of sports and math.
I have exposure to Google Analytics, SPSS, SQL, Golang, Tableau, pandas, and Git, and I am proficient in Java, Python, R, HTML, CSS, and MS 365 applications.
Please connect with me on LinkedIn [here](https://www.linkedin.com/in/christopherbussen/).
Column {.tabset data-width = 600}
---
### Picture
```{r , fig.width=6, echo=FALSE, fig.cap="Christopher Bussen", fig.align='center'}
knitr::include_graphics("headshot.jpeg")
```